Sid
Abstract:Zero-shot time series forecasting aims to predict future values for previously unseen series, requiring models to generalize temporal dynamics beyond the training distribution. While recent foundation models achieve strong in-domain performance through large-scale pretraining, their effectiveness often relies on broad data coverage and implicit pattern memorization, which can limit generalization when data are scarce or source and target domains are disjoint. In this work, we propose FSA, a feature-to-strategy framework for controlled zero-shot univariate forecasting. Instead of directly modeling raw sequences in the observation space, FSA learns a structured mapping from an interpretable feature space to an autoregressive strategy space. This design introduces explicit inductive biases that disentangle global trends, periodic components, and local temporal dynamics, enabling the model to capture transferable time-series structure with fewer data assumptions. Empirical results show that, under identical pretraining data, training protocol, and comparable parameter budgets, FSA outperforms Transformer-based architectures in our controlled zero-shot setting.
Abstract:While multimodal large language models (MLLMs) have achieved rapid progress in vision-language understanding, they remain prone to multimodal hallucinations, producing responses that are inconsistent with the visual input. Existing benchmarks predominantly focus on detecting hallucination outcomes rather than evaluating the underlying causes of these failures. Moreover, many benchmarks rely on simplistic scenarios and limited evaluation formats that no longer challenge state-of-the-art models. To address these limitations, we introduce ReactBench, a cause-driven hallucination benchmark featuring multiple tasks and an exam-style evaluation format. By generating adversarial images and hallucination-inducing queries, ReactBench introduces four targeted tasks: Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting. These tasks systematically expose co-occurrence bias, language priors, cross-image comparative perception deficiencies, and fine-grained perceptual bottlenecks. Beyond standard accuracy-based evaluation, we leverage Chain-of-Thought reasoning to identify fine-grained sub-causes of hallucination within each task. Extensive evaluations reveal that current MLLMs remain notably vulnerable to cause-specific hallucination triggers, demonstrating the value of ReactBench as a systematic and interpretable testbed for diagnosing and improving multimodal model robustness. The project page is available at https://reactbench.github.io/.
Abstract:Digital audio broadcasting plus (DAB+) is an attractive illuminator for passive radar because it provides persistent, high-power, and geographically widespread very high frequency (VHF) orthogonal frequency-division multiplexing (OFDM) signals. A channel state information (CSI) sensing approach can convert a single received DAB+ stream into a CSI sequence for radar sensing, avoiding the need for a separately received reference signal in conventional passive radars. However, CSI estimation in DAB+ is challenging due to the differentially encoded communication symbols across time. A wrong symbol transition estimation leads to a persistent multiplicative error in the sequential CSI sequence within a DAB+ frame. This paper formulates single-stream DAB+ passive radar as a posterior-probability-aware differential CSI tracking problem. The proposed method uses the previously tracked CSI as a channel prior, performs prediction-aided maximum a posteriori detection of current symbol, converts posterior transition reliability into observation uncertainty, and applies linear minimum mean squared error fusion to obtain a stable tracking CSI. A reliability-informed CSI fusion strategy is also introduced to preserve weak target information. Theoretical analysis is provided, showing guaranteed performance again in symbol and CSI estimation. Simulation results show that the proposed method can reduce CSI estimation error by over 15~dB compared with prior art. It also improves median target-to-background ratio by more than 11~dB in random fading scenes. Experiments in Sydney, Australia demonstrate improved range-Doppler maps for commercial aircraft sensing.
Abstract:Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to capture the hierarchical and interconnected nature of clinical medical knowledge, limiting the models' ability to perform fine-grained recognition and complex reasoning. In this paper, we propose a novel Entity-Centric Medical Data Engineering framework. We automatically extract entities from authoritative medical literature to construct a Medical Entity Tree (MET), a hierarchical structure that systematically encodes diseases, anatomical structures, modalities, and symptoms into a unified knowledge repository. Building upon the MET, we propose an advanced data engine that includes: (1) node-guided retrieval to anchor raw data to specific medical concepts, (2) a two-stage hybrid filtering and alignment pipeline to ensure precise visual-semantic correspondence, and (3) knowledge-aware data synthesis to generate enriched captions and targeted reasoning VQA pairs, leveraging structural constraints. Extensive evaluations across six medical benchmarks demonstrate that our approach significantly enhances the medical capabilities of general-purpose MLLMs, improving their ability to handle complex clinical queries and achieve state-of-the-art performance in diverse medical contexts.
Abstract:Text-attributed graphs (TAGs) enhance graph learning by integrating rich textual semantics and topological context for each node. While boosting expressiveness, they also expose new vulnerabilities in graph learning through text-based adversarial surfaces. Recent advances leverage diverse backbones, such as graph neural networks (GNNs) and pre-trained language models (PLMs), to capture both structural and textual information in TAGs. This diversity raises a key question: How can we design universal adversarial attacks that generalize across architectures to assess the security of TAG models? The challenge arises from the stark contrast in how different backbones-GNNs and PLMs-perceive and encode graph patterns, coupled with the fact that many PLMs are only accessible via APIs, limiting attacks to black-box settings. To address this, we propose BadGraph, a novel attack framework that deeply elicits large language models (LLMs) understanding of general graph knowledge to jointly perturb both node topology and textual semantics. Specifically, we design a target influencer retrieval module that leverages graph priors to construct cross-modally aligned attack shortcuts, thereby enabling efficient LLM-based perturbation reasoning. Experiments show that BadGraph achieves universal and effective attacks across GNN- and LLM-based reasoners, with up to a 76.3% performance drop, while theoretical and empirical analyses confirm its stealthy yet interpretable nature.
Abstract:In this correspondence, we investigate networked sensing in perceptive mobile networks under a bistatic multi-transmitter single-receiver uplink topology, where multiple user equipments (UEs) transmit signals over orthogonal frequency-division multiple access (OFDMA) resources and a single base station performs joint sensing. Uplink clock asynchronism introduces offsets that destroy inter-packet coherence and hinder high-resolution sensing, while multi-user observations exhibit exploitable cross-user correlation. We therefore formulate an asynchronous multi-user uplink OFDMA sensing model and exploit common delay-cluster sparsity across UEs. A line-of-sight (LoS)-referenced calibration first suppresses the offsets, after which a shared-private delay-domain sparse Bayesian learning (SBL) model is used for delay support recovery and user grouping. Doppler and angle of arrival are then estimated from temporal and spatial phase differences. Simulation results show that the proposed scheme outperforms per-user processing, particularly under limited subcarrier budgets and in low signal-to-noise ratio (SNR) regimes.
Abstract:Learning motion priors for physics-based humanoid control is an active research topic. Existing approaches mainly include variational autoencoders (VAE) and adversarial motion priors (AMP). VAE introduces information loss, and random latent sampling may sometimes produce invalid behaviors. AMP suffers from mode collapse and struggles to capture diverse motion skills. We present the Spherical Latent Motion Prior (SLMP), a two-stage method for learning motion priors. In the first stage, we train a high-quality motion tracking controller. In the second stage, we distill the tracking controller into a spherical latent space. A combination of distillation, a discriminator, and a discriminator-guided local semantic consistency constraint shapes a structured latent action space, allowing stable random sampling without information loss. To evaluate SLMP, we collect a two-hour human combat motion capture dataset and show that SLMP preserves fine motion detail without information loss, and random sampling yields semantically valid and stable behaviors. When applied to a two-agent physics-based combat task, SLMP produces human-like and physically plausible combat behaviors only using simple rule-based rewards. Furthermore, SLMP generalizes across different humanoid robot morphologies, demonstrating its transferability beyond a single simulated avatar.
Abstract:Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45\% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.
Abstract:We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.
Abstract:The Received Signal Strength Indicator (RSSI) is widely available on commodity WiFi devices but is commonly regarded as too coarse for fine-grained sensing. This paper revisits its sensing potential and presents WiRSSI, a bistatic WiFi sensing framework for passive human tracking using only RSSI measurements. WiRSSI adopts a 1Tx-3Rx configuration and is readily extensible to Multiple-Input Multiple-Output (MIMO) deployments. We first reveal how CSI power implicitly encodes phase-related information and how this relationship carries over to RSSI, showing that RSSI preserves exploitable Doppler, Angle-of-Arrival (AoA), and delay cues associated with human motion. WiRSSI then extracts Doppler-AoA features via a 2D Fast Fourier Transform and infers delay from amplitude-only information in the absence of subcarrier-level phase. The estimated AoA and delay are then mapped to Cartesian coordinates and denoised to recover motion trajectories. Experiments in practical environments show that WiRSSI achieves median XY localization errors of 0.905 m, 0.784 m, and 0.785 m for elliptical, linear, and rectangular trajectories, respectively. In comparison, a representative CSI-based method attains median errors of 0.574 m, 0.599 m, and 0.514 m, corresponding to an average accuracy gap of 0.26 m. These results demonstrate that, despite its lower resolution, RSSI can support practical passive sensing and offers a low-cost alternative to CSI-based WiFi sensing.